Multilingual Text Processing in a Two-Byte Code

نویسنده

  • Lloyd B. Anderson
چکیده

ABS~ACT National and international standards committees are now discussing a two-byte code for multilingual information processing. This provides for 65,536 separate character and control codes, enough to make permanent code assiguments for all the charanters of ell national alphabets of the world, and also to include Chinese/Japanese characters. This paper discusses the kinds of flexibility required to handle both Roman and non-Roman alp.habets. It is crucial to separate information units (codes) from graphic forms, to maximize processing p ower, Comparing alphabets around the world, we find t.hat the graphic devices (letters, digraphs, accent marks, punctuation, spacing, etc.) represent a very limited number of information units. It is possible to arr_ange alphabet codes to provide transliteration equivalence, the best of three solutions compared as a _eramework for code assignments.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using Chinese Text Processing Technique for the Processing of Sanskrit Based Indian Languages: Maximum Resource Utilization and Maximum Compatibility

Chinese text processing systems are using Double Byte Coding , while almost all existing Sanskrit Based Indian Languages have been using Single Byte coding for text processing. Through observation, Chinese Information Processing Technique has already achieved great technical development both in east and west. In contrast, Indian Languages are being processed by computer, more or less, for word ...

متن کامل

Leveraging Data-Driven Methods in Word-Level Language Identification for a Multilingual Alpine Heritage Corpus

This paper presents a data-driven, simple cluster-and-label approach using optimized count-based methods for word-level language identification for a large domain-specific multilingual diachronic corpus of periodicals published at least yearly between 1864 and 2014 in Switzerland. Our system requires no annotated data or training, only minimal human effort in evaluating and labeling 50 clusters...

متن کامل

Processing Text Files as Is: Pattern Matching over Compressed Texts, Multi-byte Character Texts, and Semi-structured Texts

Techniques in processing text files “as is” are presented, in which given text files are processed without modification. The compressed pattern matching problem, first defined by Amir and Benson (1992), is a good example of the “as-is” principle. Another example is string matching over multi-byte character texts, which is a significant problem common to oriental languages such as Japanese, Kore...

متن کامل

Multilingual Language Processing From Bytes

We describe an LSTM-based model which we call Byte-to-Span (BTS) that reads text as bytes and outputs span annotations of the form [start, length, label] where start positions, lengths, and labels are separate entries in our vocabulary. Because we operate directly on unicode bytes rather than languagespecific words or characters, we can analyze text in many languages with a single model. Due to...

متن کامل

Multilingual Hybrid Text Processing in Ancient Uighur (Chaghatai) Digitalized System

This research mainly considers and discusses system codepage in special techniques to multilingual processing of ancient Uighur literatures (Chagatai for abbreviation in the following text). Based on detailed analysis to Arabic code page, Farsi codepage and Uighur codepage in Unicode standard, we presented a codepage and keyboard layout, which is compatible with Chaghatai, Arabic, Farsi, Uighur...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1984